Real-world Data WranglingΒΆ

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually across elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

  1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question.
  2. Write code comments so your code is more readable.

Problem Statement:ΒΆ

In this notebook, we aim to perform exploratory data analysis (EDA) and clean two datasets related to video game sales and Steam game statistics. The Video Game Sales dataset contains information about global sales of games across different platforms, while the Steam Games dataset contains metadata, user ratings, and gameplay statistics for games available on Steam.

The key objectives of this analysis are:

  1. Data Cleaning and Quality Issues:

    • Address missing values, inconsistent data types, and remove redundant columns.
    • Ensure that the data is properly formatted to allow for accurate analysis.
  2. Tidiness Issues:

    • Split columns that contain multiple variables (e.g., platforms and categories) into separate, one-hot encoded columns to adhere to tidy data principles.
    • Ensure the dataset follows a structure where each variable has its own column and each observation is in its own row.
  3. Exploratory Data Analysis (EDA):

    • Analyze trends in video game sales across different regions and platforms.
    • Examine user ratings, playtime statistics, and the relationship between game price and user feedback on Steam.

Through this process, we aim to gain insights into the video game market across platforms and regions, as well as understand patterns in game performance on the Steam platform.

1.2. Gather at least two datasets using two different data gathering methodsΒΆ

List of data gathering methods:

  • Download data manually
  • Programmatically downloading files
  • Gather data by accessing APIs
  • Gather and extract data from HTML files using BeautifulSoup
  • Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code).

Load the dataset programmatically into this notebook.

Dataset 1ΒΆ

Type: CSV File

Method: Manually downloaded the Video Game Sales dataset from Kaggle (vgsales.csv) using the download button.

Dataset variables:

  • Rank - Ranking of overall sales
  • Name - The games name
  • Platform - Platform of the games release (i.e. PC,PS4, etc.)
  • Year - Year of the game's release
  • Genre - Genre of the game
  • Publisher - Publisher of the game
  • NA_Sales - Sales in North America (in millions)
  • EU_Sales - Sales in Europe (in millions)
  • JP_Sales - Sales in Japan (in millions)
  • Other_Sales - Sales in the rest of the world (in millions)
  • Global_Sales - Total worldwide sales.
InΒ [1]:
# Ensure plotly renders inline
import plotly.io as pio
pio.renderers.default = 'notebook'

# Downloaded Manually. See vgsales.csv
import pandas as pd

# Load dataset
vgsales_df = pd.read_csv('vgsales.csv')
print(vgsales_df.head())
   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  

Dataset 2ΒΆ

Type: CSV File

Method: Using python, programmatically downloaded and unzipped file.

Dataset variables:

  • appid - unique identifier to differentiate between original and rerelease/remaster
  • name - The game's name
  • release_date - Date of game's release
  • english - binary yes/no for english language
  • developer - developer name
  • publisher - publisher name
  • platforms - available platforms for the game
  • required_age - age requirement, 0 if not applicable or not available
  • categories
  • genres - game genres
  • steamspy_tags -
  • achievements - available achievements for the game
  • positive_ratings - number of positive ratings
  • negative_ratings - number of positive ratings
  • average_playtime - average playtime
  • median_playtime - average playtime
  • owners - number of people that own the game
  • price - price of game
InΒ [2]:
# Code found in datapull.py

# Load dataset
steam_df = pd.read_csv('steam.csv')

2. Assess dataΒΆ

Assess the data according to data quality and tidiness metrics using the report below.

List two data quality issues and two tidiness issues. Assess each data issue visually and programmatically, then briefly describe the issue you find. Make sure you include justifications for the methods you use for the assessment.

Quality Issue 1:ΒΆ

InΒ [3]:
# Inspecting dataframes visually
# Print first few rows and dataframe info for each 
print(vgsales_df.head())
vgsales_df.info()

print(steam_df.head())
steam_df.info()
   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
   appid                       name release_date  english         developer  \
0     10             Counter-Strike   2000-11-01        1             Valve   
1     20      Team Fortress Classic   1999-04-01        1             Valve   
2     30              Day of Defeat   2003-05-01        1             Valve   
3     40         Deathmatch Classic   2001-06-01        1             Valve   
4     50  Half-Life: Opposing Force   1999-11-01        1  Gearbox Software   

  publisher          platforms  required_age  \
0     Valve  windows;mac;linux             0   
1     Valve  windows;mac;linux             0   
2     Valve  windows;mac;linux             0   
3     Valve  windows;mac;linux             0   
4     Valve  windows;mac;linux             0   

                                          categories  genres  \
0  Multi-player;Online Multi-Player;Local Multi-P...  Action   
1  Multi-player;Online Multi-Player;Local Multi-P...  Action   
2              Multi-player;Valve Anti-Cheat enabled  Action   
3  Multi-player;Online Multi-Player;Local Multi-P...  Action   
4  Single-player;Multi-player;Valve Anti-Cheat en...  Action   

                  steamspy_tags  achievements  positive_ratings  \
0        Action;FPS;Multiplayer             0            124534   
1        Action;FPS;Multiplayer             0              3318   
2  FPS;World War II;Multiplayer             0              3416   
3        Action;FPS;Multiplayer             0              1273   
4             FPS;Action;Sci-fi             0              5250   

   negative_ratings  average_playtime  median_playtime             owners  \
0              3339             17612              317  10000000-20000000   
1               633               277               62   5000000-10000000   
2               398               187               34   5000000-10000000   
3               267               258              184   5000000-10000000   
4               288               624              415   5000000-10000000   

   price  
0   7.19  
1   3.99  
2   3.99  
3   3.99  
4   3.99  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27074 non-null  object 
 5   publisher         27061 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object 
 17  price             27075 non-null  float64
dtypes: float64(1), int64(8), object(9)
memory usage: 3.7+ MB
InΒ [4]:
# Check summary of dataset
print("\nDataset information:")
vgsales_df.info()

# Check for missing values
print("\nMissing values in each column:")
missing_data = vgsales_df.isnull().sum()
print(missing_data)

# Display basic statistics
print("\nStatistical summary:")
print(vgsales_df.describe())

# Check summary of dataset
print("\nDataset information:")
steam_df.info()

# Check for missing values
print("\nMissing values in each column:")
missing_data = steam_df.isnull().sum()
print(missing_data)

# Display basic statistics
print("\nStatistical summary:")
print(steam_df.describe())
Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB

Missing values in each column:
Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

Statistical summary:
               Rank          Year      NA_Sales      EU_Sales      JP_Sales  \
count  16598.000000  16327.000000  16598.000000  16598.000000  16598.000000   
mean    8300.605254   2006.406443      0.264667      0.146652      0.077782   
std     4791.853933      5.828981      0.816683      0.505351      0.309291   
min        1.000000   1980.000000      0.000000      0.000000      0.000000   
25%     4151.250000   2003.000000      0.000000      0.000000      0.000000   
50%     8300.500000   2007.000000      0.080000      0.020000      0.000000   
75%    12449.750000   2010.000000      0.240000      0.110000      0.040000   
max    16600.000000   2020.000000     41.490000     29.020000     10.220000   

        Other_Sales  Global_Sales  
count  16598.000000  16598.000000  
mean       0.048063      0.537441  
std        0.188588      1.555028  
min        0.000000      0.010000  
25%        0.000000      0.060000  
50%        0.010000      0.170000  
75%        0.040000      0.470000  
max       10.570000     82.740000  

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27074 non-null  object 
 5   publisher         27061 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object 
 17  price             27075 non-null  float64
dtypes: float64(1), int64(8), object(9)
memory usage: 3.7+ MB

Missing values in each column:
appid                0
name                 0
release_date         0
english              0
developer            1
publisher           14
platforms            0
required_age         0
categories           0
genres               0
steamspy_tags        0
achievements         0
positive_ratings     0
negative_ratings     0
average_playtime     0
median_playtime      0
owners               0
price                0
dtype: int64

Statistical summary:
              appid       english  required_age  achievements  \
count  2.707500e+04  27075.000000  27075.000000  27075.000000   
mean   5.962035e+05      0.981127      0.354903     45.248864   
std    2.508942e+05      0.136081      2.406044    352.670281   
min    1.000000e+01      0.000000      0.000000      0.000000   
25%    4.012300e+05      1.000000      0.000000      0.000000   
50%    5.990700e+05      1.000000      0.000000      7.000000   
75%    7.987600e+05      1.000000      0.000000     23.000000   
max    1.069460e+06      1.000000     18.000000   9821.000000   

       positive_ratings  negative_ratings  average_playtime  median_playtime  \
count      2.707500e+04      27075.000000      27075.000000      27075.00000   
mean       1.000559e+03        211.027147        149.804949        146.05603   
std        1.898872e+04       4284.938531       1827.038141       2353.88008   
min        0.000000e+00          0.000000          0.000000          0.00000   
25%        6.000000e+00          2.000000          0.000000          0.00000   
50%        2.400000e+01          9.000000          0.000000          0.00000   
75%        1.260000e+02         42.000000          0.000000          0.00000   
max        2.644404e+06     487076.000000     190625.000000     190625.00000   

              price  
count  27075.000000  
mean       6.078193  
std        7.874922  
min        0.000000  
25%        1.690000  
50%        3.990000  
75%        7.190000  
max      421.990000  

Data Quality Issues:ΒΆ

  1. Missing Values:

    • In the Video Game Sales dataset:
      • Year column has 271 missing values.
      • Publisher column has 58 missing values.
    • In the Steam Games dataset:
      • developer column has 1 missing value.
      • publisher column has 14 missing values.
  2. Inconsistent Data Types:

    • In the Steam Games dataset, the release_date column is an object (string), but it should be a datetime type for proper time-based analysis.
    • In the Video Game Sales dataset, the Year column is a float64, which should be converted to an integer or date-based format for clarity.

Tidiness Issues:ΒΆ

  1. Redundant Global_Sales Column:

    • In the Video Game Sales dataset, the Global_Sales column is a sum of the regional sales (NA_Sales, EU_Sales, JP_Sales, Other_Sales). This could be derived and is redundant, however, we are not interested in
  2. Multiple Variables in a Single Column:

    • In the Steam Games dataset, the platforms and categories columns contain multiple values (separated by semicolons). Each of these should ideally be split into individual columns for proper analysis.

3. Clean dataΒΆ

Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. Make sure you include justifications for your cleaning decisions.

After the cleaning for each issue, please use either the visually or programmatically method to validate the cleaning was successful.

InΒ [5]:
# Load datasets
vgsales_df = pd.read_csv('vgsales.csv')
steam_df = pd.read_csv('steam.csv')

# Make copies of datasets
vgsales_copy = vgsales_df.copy()
steam_copy = steam_df.copy()

# Check first few rows to confirm data is loaded correctly
print(vgsales_copy.head())
print(steam_copy.head())
   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  
   appid                       name release_date  english         developer  \
0     10             Counter-Strike   2000-11-01        1             Valve   
1     20      Team Fortress Classic   1999-04-01        1             Valve   
2     30              Day of Defeat   2003-05-01        1             Valve   
3     40         Deathmatch Classic   2001-06-01        1             Valve   
4     50  Half-Life: Opposing Force   1999-11-01        1  Gearbox Software   

  publisher          platforms  required_age  \
0     Valve  windows;mac;linux             0   
1     Valve  windows;mac;linux             0   
2     Valve  windows;mac;linux             0   
3     Valve  windows;mac;linux             0   
4     Valve  windows;mac;linux             0   

                                          categories  genres  \
0  Multi-player;Online Multi-Player;Local Multi-P...  Action   
1  Multi-player;Online Multi-Player;Local Multi-P...  Action   
2              Multi-player;Valve Anti-Cheat enabled  Action   
3  Multi-player;Online Multi-Player;Local Multi-P...  Action   
4  Single-player;Multi-player;Valve Anti-Cheat en...  Action   

                  steamspy_tags  achievements  positive_ratings  \
0        Action;FPS;Multiplayer             0            124534   
1        Action;FPS;Multiplayer             0              3318   
2  FPS;World War II;Multiplayer             0              3416   
3        Action;FPS;Multiplayer             0              1273   
4             FPS;Action;Sci-fi             0              5250   

   negative_ratings  average_playtime  median_playtime             owners  \
0              3339             17612              317  10000000-20000000   
1               633               277               62   5000000-10000000   
2               398               187               34   5000000-10000000   
3               267               258              184   5000000-10000000   
4               288               624              415   5000000-10000000   

   price  
0   7.19  
1   3.99  
2   3.99  
3   3.99  
4   3.99  

Quality IssuesΒΆ

InΒ [6]:
# Handling data quality issue in vgsales_copy (Video Game Sales dataset)
# Drop 'Year' column
vgsales_copy = vgsales_copy.drop(columns=['Year'])

# Fill missing values in 'Publisher' with 'Unknown'
vgsales_copy['Publisher'] = vgsales_copy['Publisher'].fillna('Unknown')

# Ensure float columns are correctly formatted & round decimals to 2 places
sales_columns = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
vgsales_copy[sales_columns] = vgsales_copy[sales_columns].round(2)

# Handling data quality issue in steam_copy (Steam Games dataset)
# Fill missing values in 'developer' and 'publisher' columns with 'Unknown'
steam_copy['developer'] = steam_copy['developer'].fillna('Unknown')
steam_copy['publisher'] = steam_copy['publisher'].fillna('Unknown')

# Convert 'release_date' column to datetime format
steam_copy['release_date'] = pd.to_datetime(steam_copy['release_date'], errors='coerce')
InΒ [7]:
# Check for missing values in vgsales_copy
print(vgsales_copy.isnull().sum())

# Check for missing values in steam_copy
print(steam_copy.isnull().sum())

# Display dtypes to confirm correct types
print(steam_copy.dtypes)
Rank            0
Name            0
Platform        0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
dtype: int64
appid               0
name                0
release_date        0
english             0
developer           0
publisher           0
platforms           0
required_age        0
categories          0
genres              0
steamspy_tags       0
achievements        0
positive_ratings    0
negative_ratings    0
average_playtime    0
median_playtime     0
owners              0
price               0
dtype: int64
appid                        int64
name                        object
release_date        datetime64[ns]
english                      int64
developer                   object
publisher                   object
platforms                   object
required_age                 int64
categories                  object
genres                      object
steamspy_tags               object
achievements                 int64
positive_ratings             int64
negative_ratings             int64
average_playtime             int64
median_playtime              int64
owners                      object
price                      float64
dtype: object

Justification: Data formats now match and there are no missing values.

Tidiness IssuesΒΆ

InΒ [8]:
# Addressing tidiness issue #2: Splitting 'platforms' and 'categories' columns in steam_copy
# Splitting 'platforms' into separate columns
platforms_expanded = steam_copy['platforms'].str.get_dummies(sep=';')
steam_copy = pd.concat([steam_copy, platforms_expanded], axis=1)

# Splitting 'categories' into separate columns
categories_expanded = steam_copy['categories'].str.get_dummies(sep=';')
steam_copy = pd.concat([steam_copy, categories_expanded], axis=1)

# Drop original 'platforms' and 'categories' columns since they have been expanded and are now redundant
steam_copy = steam_copy.drop(columns=['platforms', 'categories'])
InΒ [9]:
# Checking modified dataframes
print(vgsales_copy.head())
print(steam_copy.head())
   Rank                      Name Platform         Genre Publisher  NA_Sales  \
0     1                Wii Sports      Wii        Sports  Nintendo     41.49   
1     2         Super Mario Bros.      NES      Platform  Nintendo     29.08   
2     3            Mario Kart Wii      Wii        Racing  Nintendo     15.85   
3     4         Wii Sports Resort      Wii        Sports  Nintendo     15.75   
4     5  Pokemon Red/Pokemon Blue       GB  Role-Playing  Nintendo     11.27   

   EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     29.02      3.77         8.46         82.74  
1      3.58      6.81         0.77         40.24  
2     12.88      3.79         3.31         35.82  
3     11.01      3.28         2.96         33.00  
4      8.89     10.22         1.00         31.37  
   appid                       name release_date  english         developer  \
0     10             Counter-Strike   2000-11-01        1             Valve   
1     20      Team Fortress Classic   1999-04-01        1             Valve   
2     30              Day of Defeat   2003-05-01        1             Valve   
3     40         Deathmatch Classic   2001-06-01        1             Valve   
4     50  Half-Life: Opposing Force   1999-11-01        1  Gearbox Software   

  publisher  required_age  genres                 steamspy_tags  achievements  \
0     Valve             0  Action        Action;FPS;Multiplayer             0   
1     Valve             0  Action        Action;FPS;Multiplayer             0   
2     Valve             0  Action  FPS;World War II;Multiplayer             0   
3     Valve             0  Action        Action;FPS;Multiplayer             0   
4     Valve             0  Action             FPS;Action;Sci-fi             0   

   ...  Stats  Steam Achievements  Steam Cloud  Steam Leaderboards  \
0  ...      0                   0            0                   0   
1  ...      0                   0            0                   0   
2  ...      0                   0            0                   0   
3  ...      0                   0            0                   0   
4  ...      0                   0            0                   0   

  Steam Trading Cards  Steam Turn Notifications  Steam Workshop  \
0                   0                         0               0   
1                   0                         0               0   
2                   0                         0               0   
3                   0                         0               0   
4                   0                         0               0   

   SteamVR Collectibles  VR Support  Valve Anti-Cheat enabled  
0                     0           0                         1  
1                     0           0                         1  
2                     0           0                         1  
3                     0           0                         1  
4                     0           0                         1  

[5 rows x 48 columns]

5. Answer the research questionΒΆ

5.1: Define and answer the research questionΒΆ

Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce at least two visualizations using the cleaned data and explain how they help you answer the question.

Research question:

  • Analyze trends in video game sales across different regions.
  • Examine user ratings, playtime statistics, and the relationship between game price and user feedback on Steam.
InΒ [10]:
# Filter vgsales_copy dataset to only include rows where Platform is 'PC'
vgsales_pc = vgsales_copy[vgsales_copy['Platform'] == 'PC']

# Inner join on 'name' from steam_copy and 'Name' from vgsales_copy
merged_df = pd.merge(steam_copy, vgsales_pc, left_on='name', right_on='Name', how='inner')

import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

# Group sales data by Publisher and region
publisher_sales_by_region = merged_df.groupby('Publisher')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()

# Find best performing publisher for each
best_publishers_na = publisher_sales_by_region['NA_Sales'].nlargest(5)
best_publishers_eu = publisher_sales_by_region['EU_Sales'].nlargest(5)
best_publishers_jp = publisher_sales_by_region['JP_Sales'].nlargest(5)
best_publishers_other = publisher_sales_by_region['Other_Sales'].nlargest(5)

# Plot bar charts for top publishers in each
fig_na = px.bar(best_publishers_na, x=best_publishers_na.index, y='NA_Sales',
                title="Top 5 Publishers in North America by Sales",
                labels={'NA_Sales': 'Sales in Millions'}, color='NA_Sales')

fig_eu = px.bar(best_publishers_eu, x=best_publishers_eu.index, y='EU_Sales',
                title="Top 5 Publishers in Europe by Sales",
                labels={'EU_Sales': 'Sales in Millions'}, color='EU_Sales')

fig_jp = px.bar(best_publishers_jp, x=best_publishers_jp.index, y='JP_Sales',
                title="Top 5 Publishers in Japan by Sales",
                labels={'JP_Sales': 'Sales in Millions'}, color='JP_Sales')

fig_other = px.bar(best_publishers_other, x=best_publishers_other.index, y='Other_Sales',
                   title="Top 5 Publishers in Other Regions by Sales",
                   labels={'Other_Sales': 'Sales in Millions'}, color='Other_Sales')

# Show plots
fig_na.show()
fig_eu.show()
fig_jp.show()
fig_other.show()

# Calculate total sales by region
region_sales = {
    'Region': ['North America', 'Europe', 'Japan', 'Other'],
    'Sales': [merged_df['NA_Sales'].sum(), 
              merged_df['EU_Sales'].sum(), 
              merged_df['JP_Sales'].sum(), 
              merged_df['Other_Sales'].sum()]
}

# Create a DataFrame from region sales data
region_sales_df = pd.DataFrame(region_sales)

# Pie chart showing total revenue by region
fig_pie_revenue_by_region = px.pie(region_sales_df, values='Sales', names='Region',
                                   title="Total Revenue Distribution by Region")

# Show pie chart
fig_pie_revenue_by_region.show()

The graphs above illustrate different publisher's success by region. While it appears that each region may have a preference for certain Publishers, this could also be due to other factors, such as regulations or markets that Publishers may or may not operate in. About half of all sales occur in North America, illustrating it to be the largest consuming market. However, this data could be misleading without considering overall populations and market percentage of each region.

InΒ [11]:
# Group by Publisher and sum positive and negative ratings
ratings_by_publisher = merged_df.groupby('Publisher')[['positive_ratings', 'negative_ratings']].sum().reset_index()

# Calculate ratio of positive to negative ratings, handle division by zero
ratings_by_publisher['rating_ratio'] = ratings_by_publisher['positive_ratings'] / ratings_by_publisher['negative_ratings'].replace(0, 1)

# Sort by ratio of positive to negative ratings
ratings_by_publisher = ratings_by_publisher.sort_values(by='rating_ratio', ascending=False)

# Plot bar chart for ratio of positive to negative ratings by publisher
fig_ratio = px.bar(ratings_by_publisher, x='Publisher', y='rating_ratio',
                   title="Positive to Negative Ratings Ratio by Publisher",
                   labels={'rating_ratio': 'Positive/Negative Rating Ratio', 'Publisher': 'Publisher'},
                   color='rating_ratio')

# Customize layout to avoid overlapping labels
fig_ratio.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Positive to Negative Rating Ratio")

# Show plot for positive-to-negative rating ratio
fig_ratio.show()

# Now, create a chart for revenue (global sales) of each publisher
# Group by Publisher and sum Global Sales (Revenue) column
revenue_by_publisher = vgsales_pc.groupby('Publisher')['Global_Sales'].sum().reset_index()

# Filter to include only publishers with over 0.75 million in revenue
revenue_by_publisher_filtered = revenue_by_publisher[revenue_by_publisher['Global_Sales'] > 0.75]

# Sort by total Global Sales
revenue_by_publisher_filtered = revenue_by_publisher_filtered.sort_values(by='Global_Sales', ascending=False)

# Plot bar chart for revenue by publisher
fig_revenue = px.bar(revenue_by_publisher_filtered, x='Publisher', y='Global_Sales',
                     title="Revenue (Global Sales) by Publisher (over 0.75 million)",
                     labels={'Global_Sales': 'Revenue (in Millions)', 'Publisher': 'Publisher'},
                     color='Global_Sales')

# Customize layout to avoid overlapping labels
fig_revenue.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Revenue (in Millions)")

# Show plot for revenue by publisher
fig_revenue.show()

# Now, let's add a graph for average revenue per game by publisher
# Group by Publisher, sum Global Sales, and count number of games
publisher_stats = vgsales_pc.groupby('Publisher').agg(
    total_revenue=('Global_Sales', 'sum'), 
    num_games=('Name', 'count')
).reset_index()

# Calculate average revenue per game
publisher_stats['avg_revenue_per_game'] = publisher_stats['total_revenue'] / publisher_stats['num_games']

# Filter publishers with average revenue per game greater than 0.3 million
publisher_stats_filtered = publisher_stats[publisher_stats['avg_revenue_per_game'] > 0.3]

# Sort by average revenue per game in descending order
publisher_stats_filtered = publisher_stats_filtered.sort_values(by='avg_revenue_per_game', ascending=False)

# Plot bar chart for average revenue per game by publisher
fig_avg_revenue = px.bar(
    publisher_stats_filtered, 
    x='Publisher', 
    y='avg_revenue_per_game',
    title="Average Revenue per Game by Publisher (over 0.3 million)",
    labels={'avg_revenue_per_game': 'Average Revenue per Game (in Millions)', 'Publisher': 'Publisher'},
    color='avg_revenue_per_game'
)

# Customize layout to avoid overlapping labels
fig_avg_revenue.update_layout(xaxis_tickangle=-45, xaxis_title="Publisher", yaxis_title="Average Revenue per Game (in Millions)")

# Show plot for average revenue per game by publisher
fig_avg_revenue.show()

# Optional: Log-log analysis for correlation between ratings and revenue as in previous code
# Apply log transformation to positive and negative ratings and global sales
merged_df['log_positive_ratings'] = np.log1p(merged_df['positive_ratings'])
merged_df['log_negative_ratings'] = np.log1p(merged_df['negative_ratings'])
merged_df['log_global_sales'] = np.log1p(merged_df['Global_Sales'])

# Scatter Plot 1: Log of Positive Ratings vs Log of Global Sales
fig_loglog_positive_sales = px.scatter(merged_df, x='log_positive_ratings', y='log_global_sales',
                                       title="Log-Log Correlation Between Positive Ratings and Global Sales",
                                       labels={'log_positive_ratings': 'Log of Positive Ratings', 'log_global_sales': 'Log of Revenue'},
                                       trendline="ols",  # Add a trendline to visualize correlation
                                       color='log_positive_ratings')

# Show log-log scatter plot with a trendline for Positive Ratings vs Sales
fig_loglog_positive_sales.show()

# Scatter Plot 2: Log of Negative Ratings vs Log of Global Sales
fig_loglog_negative_sales = px.scatter(merged_df, x='log_negative_ratings', y='log_global_sales',
                                       title="Log-Log Correlation Between Negative Ratings and Global Sales",
                                       labels={'log_negative_ratings': 'Log of Negative Ratings', 'log_global_sales': 'Log of Revenue'},
                                       trendline="ols",  # Add a trendline to visualize correlation
                                       color='log_negative_ratings')

# Show log-log scatter plot with a trendline for Negative Ratings vs Sales
fig_loglog_negative_sales.show()

# Calculate and print correlation coefficients for log-log values
correlation_loglog_positive = merged_df[['log_positive_ratings', 'log_global_sales']].corr().iloc[0, 1]
correlation_loglog_negative = merged_df[['log_negative_ratings', 'log_global_sales']].corr().iloc[0, 1]
print(f"Correlation between log of positive ratings and log of revenue: {correlation_loglog_positive:.2f}")
print(f"Correlation between log of negative ratings and log of revenue: {correlation_loglog_negative:.2f}")

# Group by Publisher, sum Global Sales, and count number of games
publisher_stats = vgsales_pc.groupby('Publisher').agg(
    total_revenue=('Global_Sales', 'sum'), 
    num_games=('Name', 'count')
).reset_index()

# Calculate average revenue per game
publisher_stats['avg_revenue_per_game'] = publisher_stats['total_revenue'] / publisher_stats['num_games']

# Group by Publisher and sum positive and negative ratings
ratings_by_publisher = merged_df.groupby('Publisher')[['positive_ratings', 'negative_ratings']].sum().reset_index()

# Calculate ratio of positive to negative ratings, handling division by zero
ratings_by_publisher['rating_ratio'] = ratings_by_publisher['positive_ratings'] / ratings_by_publisher['negative_ratings'].replace(0, 1)

# Merge publisher stats with ratings data
merged_publisher_stats = pd.merge(publisher_stats, ratings_by_publisher, on='Publisher')

# Apply a logarithmic transformation to positive to negative rating ratio
merged_publisher_stats['log_rating_ratio'] = np.log1p(merged_publisher_stats['rating_ratio'])

# Scatter plot: Average revenue per game vs. Log of Positive to Negative rating ratio
fig_scatter = px.scatter(
    merged_publisher_stats, 
    x='log_rating_ratio', 
    y='avg_revenue_per_game',
    title="Average Revenue per Game vs Log of Positive to Negative Rating Ratio by Publisher",
    labels={'log_rating_ratio': 'Log of Positive/Negative Rating Ratio', 'avg_revenue_per_game': 'Average Revenue per Game (in Millions)'},
    color='avg_revenue_per_game',  # Color by average revenue
    hover_data=['Publisher'],  # Show publisher info on hover
    trendline='ols'  # Add a trendline for linear regression
)

# Show scatter plot
fig_scatter.show()

# Calculate RΒ² value using OLS (Ordinary Least Squares) trendline
results = px.get_trendline_results(fig_scatter)
ols_result = results.iloc[0]["px_fit_results"]

# Get RΒ² value from OLS result
r_squared = ols_result.rsquared
print(f"RΒ² value: {r_squared:.4f}")

# Calculate positive to negative rating ratio
merged_df['rating_ratio'] = merged_df['positive_ratings'] / merged_df['negative_ratings'].replace(0, 1)

# Filter necessary columns (Revenue, Time Played, Rating Ratio)
correlation_data = merged_df[['Global_Sales', 'average_playtime', 'rating_ratio']]

# Calculate correlation matrix
correlation_matrix = correlation_data.corr()

# Plot correlation heatmap using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix: Revenue vs Time Played vs Rating Ratio")
plt.show()

# Group vgsales_copy dataset by Genre and sum Global Sales
sales_by_genre = vgsales_copy.groupby('Genre')['Global_Sales'].sum().reset_index()

# Sort genres by total global sales in descending order
sales_by_genre = sales_by_genre.sort_values(by='Global_Sales', ascending=False)

# Plot bar chart for total sales by genre
fig_genre_sales = px.bar(
    sales_by_genre, 
    x='Genre', 
    y='Global_Sales', 
    title="Total Global Sales by Genre",
    labels={'Global_Sales': 'Total Sales (in Millions)', 'Genre': 'Genre'},
    color='Global_Sales'
)

# Customize layout to avoid overlapping labels
fig_genre_sales.update_layout(xaxis_tickangle=-45)

# Show plot
fig_genre_sales.show()

# Group the vgsales_copy dataset by Genre, calculate the total sales and count the number of games
genre_stats = vgsales_copy.groupby('Genre').agg(
    total_sales=('Global_Sales', 'sum'),
    num_games=('Name', 'count')
).reset_index()

# Calculate the average revenue per game for each genre
genre_stats['avg_revenue_per_game'] = genre_stats['total_sales'] / genre_stats['num_games']

# Sort the genres by average revenue per game in descending order
genre_stats = genre_stats.sort_values(by='avg_revenue_per_game', ascending=False)

# Plot the bar chart for average revenue per game by genre
fig_avg_revenue_genre = px.bar(
    genre_stats, 
    x='Genre', 
    y='avg_revenue_per_game', 
    title="Average Revenue per Game by Genre",
    labels={'avg_revenue_per_game': 'Average Revenue per Game (in Millions)', 'Genre': 'Genre'},
    color='avg_revenue_per_game'
)

# Customize the layout to avoid overlapping labels
fig_avg_revenue_genre.update_layout(xaxis_tickangle=-45)

# Show the plot
fig_avg_revenue_genre.show()
Correlation between log of positive ratings and log of revenue: 0.48
Correlation between log of negative ratings and log of revenue: 0.42
RΒ² value: 0.0655
No description has been provided for this image

From the graphs above, there are no strong correlations. It is difficult to confirm without additional data, but it would appear that stronger predictors are Customer Loyalty and Genre Preference. Both of these may even be tied back to the Publishing studio. Platform and Shooter genres seem to perform well, but this could also be skewed due to Customer Loyalty.

5.2: ReflectionΒΆ

In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

Answer: I would seek additional datasets, including populations and surveys, to better understand the demographics of gamers and how they influence game sales across regions. I would also explore deeper data quality issues, such as missing or incomplete data for specific games and genres, and ensure that outliers in sales or playtime are properly handled. Additionally, I would investigate other factors influencing sales, such as marketing spend, game reviews, and platform popularity, to gain a comprehensive understanding of what drives revenue.